Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.
Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.
Phoenix instruments OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI via OpenInference. Local UI or Arize cloud. No per-call code changes.